12. PPO summary

PPO Summary

TLPPO Summary V1

So that’s it! We can finally summarize the PPO algorithm

  1. First, collect some trajectories based on some policy \pi_\theta, and initialize theta prime \theta'=\theta
  2. Next, compute the gradient of the clipped surrogate function using the trajectories
  3. Update \theta' using gradient ascent \theta'\leftarrow\theta' +\alpha \nabla_{\theta'}L_{\rm sur}^{\rm clip}(\theta', \theta)
  4. Then we repeat step 2-3 without generating new trajectories. Typically, step 2-3 are only repeated a few times
  5. Set \theta=\theta', go back to step 1, repeat.

The details of PPO was originally published by the team at OpenAI, and you can read their paper through this link